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Abstract 

kLog is a logical and relational language for kernel-based learning. It al- 
lows users to specify logical and relational learning problems at a high level 
in a declarative way. It builds on simple but powerful concepts: learning 
from interpretations, entity /relationship data modeling, logic programming 
and deductive databases (Prolog and Datalog), and graph kernels. kLog is 
a statistical relational learning system but unlike other statistical relational 
learning models, it does not represent a probability distribution directly. It 
is rather a kernel-based approach to learning that employs features derived 
from a grounded entity /relationship diagram. These features are derived 
using a novel technique called graphicalization: first, relational representa- 
tions are transformed into graph based representations; subsequently, graph 
kernels are employed for defining feature spaces. kLog can use numerical 
and symbolic data, background knowledge in the form of Prolog or Datalog 
programs (as in inductive logic programming systems) and several statistical 
procedures can be used to fit the model parameters. The kLog framework can 
— in principle — be applied to tackle the same range of tasks that has made 
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statistical relational learning so popular, including classification, regression, 
multitask learning, and collective classification. 

Keywords: Logical and relational learning. Statistical Relational learning, 
kernel methods, Prolog, Deductive databases 



1. Introduction 

The field of statistical relational learning (SRL) is populated with a fairly 
large number of models and alternative representations, a state-of-affairs of- 
ten referred to as the "SRL alphabet soup" [U |2]. Even though there are 
many differences between these approaches, they typically extend a proba- 
bilistic representation (most often, a graphical model) with a logical or rela- 
tional one [3,, 4J. The resulting models then define a probability distribution 
over possible worlds, which are typically (Herbrand) interpretations assign- 
ing a truth value to every ground fact. In the machine learning literature 
[5j, interpretations are often used to model relational learning problems be- 
cause they naturally represent entities (or objects) as well as the relationships 
amongst them. kLog also adopts the learning from interpretations setting. 
However, unlike typical statistical relational learning frameworks, kLog does 
not employ a probabilistic framework but is rather based on linear model- 
ing in a kernel-defined feature space. Using linear modeling and learning 
from interpretations, we develop a synthetic view of relational learning. In 
this view, instances, i.e. interpretations, are sampled identically and indepen- 
dently from some unknown but fixed distribution. In the supervised learning 
setting, they are represented as pairs z = (x, y) of sets of ground atoms (x 
being the inputs, y the outputs) and the task is to learn a function mapping 
inputs to outputs. 

We use the following general approach to construct a statistical model for 
supervised learning. First, a feature vector (j){x^y) is associated with each 
interpretation. A potential function based on the linear model F(x, y) = 
w^cp^x^ y) is then used to "score" the interpretation. Prediction (or inference) 
is the process of maximizing F with respect to y. Learning is the process of 
fitting w to the available data, typically using some statistically motivated 
loss function that measures the discrepancy between the prediction f{xi) = 
argmax^ F(x^, y) and the observed output yi on the i-th training instance. 
Clearly, inference is needed as a subroutine during learning. In principle, 
a conditional distribution on outputs could be constructed from F, e.g. as 
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P{y\x) = Y- exp(F(x, y)) for a properly chosen partition function Zx. In this 
case, f{x) can be interpreted as the solution of the maximum- a-posteriori 
(MAP) inference problem. In this paper however, as it happens in other 
kernel-based approaches to structured output learning (e.g. [6j), we are not 
directly interested in probabilities. 

The above perspective covers a number of commonly used algorithms 
ranging from propositional to relational learning. To see this, consider first 
binary classification of categorical attribute- value data. In this case, mod- 
els such as Naive Bayes, logistic regression, and support vector machines 
(SVM) can all be constructed to share exactly the same feature space. Using 
indicator functions on attribute values as features, the three models use a 
hyperplane as their decision function: /(x) = argmax^^{faise,true} '^^0(^5 ?/) 
where for Naive Bayes, the joint probability of (x^y) is proportional to 
exp(t(;^(/)(x, y)). The only difference between the three models is actually 
in the way w is fitted to data: SVM optimizes a regularized functional based 
on the hinge loss function, logistic regression maximizes the conditional like- 
lihood of outputs given inputs (which can be seen as minimizing a smoothed 
version of the SVM hinge loss), and Naive Bayes maximises the joint like- 
lihood of inputs and outputs. The last two models are often cited as an 
example of generative-discriminative conjugate pairs because of the above 
reasons [7]. 

When moving up to a slightly richer data type like sequences (perhaps 
the simplest case of relational data), the three models have well known ex- 
tensions: Naive Bayes extends to hidden Markov models (HMMs), logistic 
regresion extends to conditional random fields (CRFs) [8j, and SVM extends 
to structured output SVM for sequences [9l |6] . Note that when HMMs are 
used in the supervised learning setting (in applications such as part-of-speech 
tagging) the observation x is the input sequence, and the states form the out- 
put sequence y (which is observed in training data) . In the simplest version of 
these three models, (j){x^y) contains a feature for every pair of states (transi- 
tions) and a feature for every state-observation pair (emissions). Again, these 
models all use the same feature space (see e.g. [8] for a detailed discussion). 

When moving up to arbitrary relations, the three above models can again 
be generalized but many complications arise, and the large number of alterna- 
tive models suggested in the literature has originated the SRL alphabet soup 
mentioned above. Among generative models, one natural extension of HMMs 
is stochastic context free grammars [10], which in turn can be extended to 
stochastic logic programs pjj . More expressive systems include probabilistic 
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Table 1: Relationship among some statistical relational learning models. 
Prepositional Sequences General relations 

Naive Bayes HMM Generative MLN 

Logistic regression CRF Discriminative MLN 

SVM SVM-HMM M^N 

relational models (PRMs) [12j and Markov logic networks (MLNs) [13j, when 
trained generatively. Generalizations of SVM for relational structures akin to 
context free grammars have been also investigated [6j . Among discriminative 
models, CRFs can be extended from linear chains to arbitrary relations [8], 
for example in the form of discriminative Markov networks [14j and discrim- 
inative Markov logic networks [13j. The use of SVM-like loss functions has 
also been explored in max-margin Markov networks (M^N) [15j. These mod- 
els can cope with relational data by adopting a richer feature space. Table [T] 
shows the relationships among some of these approaches. Methods on the 
same row use similar loss functions, while methods in the same column can 
be arranged to share the same feature space. 

kLog contributes to this perspective as it is a language for generating a set 
of features starting from a logical and relational learning problem and using 
these features for learning a (linear) statistical model. To realize this, kLog 
essentially describes learning problems at three different levels. The first 
level specifies the logical and relational learning problem. At this level, the 
description consists of a set of signatures describing the structure of the data 
and the data itself. The data at this level are then graphicalized^ that is, the 
interpretations are transformed into graphs. This leads to the specification of 
a graph learning problem at the second level. Finally, the graphs are turned 
into feature vectors using a graph kernel, which leads to a statistical learning 
problem at the third level. 

It is important to realize that this is a very fiexible architecture in which 
only the specification language of the first level is fixed; at this level, we em- 
ploy an entity /relationship (E/R) model. The second level is then completely 
determined by the choice of the graph kernel. In the current implementation 
of kLog that we describe in this paper, we employ the neighborhood sub- 
graph pairwise distance kernel (NSPDK) of [16j but the reader should keep 
in mind that other graph kernels can be incorporated. Similarly for learning 
the linear model at level three we mainly experimented with variants of SVM 
learners but again it is important to realize that other learners can be plugged 
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in. This situation is akin to that for statistical relational learning represen- 
tations such as Markov Logic [T3] , where a Markov Logic program essentially 
specifies a set of features that can be used for inference and learning, and 
for which a multitude of diflFerent inference and learning algorithms has been 
devised (see also Section [6] for a discussion of the relationships between kLog 
and other SRL systems). 

2. kLog Semantics 

2.1. Data model 

kLog builds upon logical and relational data representation, which is 
closely related to the classic E/R data model, a commonly used design tool in 
database development (see, e.g., [17j). The main ontological assumption is 
that there are objects and relations in the domain but no functions. kLog is 
based on clausal logic, is embedded in Prolog, and inherits several syntactic 
and semantic aspects of Prolog. 

kLog learns from interpretations. An interpretation is a set of ground 
atoms. A ground atom r(ci, . . . , c^) is a relation symbol (or predicate name) 
r of arity n followed by an n-tuple of constants Structured terms are not 
allowed. We denote by C the set of constants (objects) in the domain and 
by IZ the set of relations. An interpretation contains all the atoms that are 
true and all atoms not in the interpretation are assumed to be false. It can 
therefore also be regarded as a set of tuples in a relational database. So, 
in database terminology, one interpretation corresponds to one instance of a 
relational database describing one possible world (e.g. one molecule or one 
image). In logic programming terminology, we are using so-called Herbrand 
interpretations. 

Because of the close similarity between the semantics of kLog and databases, 
we will sometimes borrow expressions from database theory. For example, 
we will use the term column to refer to the set of constants that, in a given 
interpretation, appear as the i-th argument in true ground atoms of a certain 
relation. 

kLog makes some assumptions about the domain and about acceptable 
interpretations. In particular: 

A.l {Object types). The set of constants C is partitioned into an enumer- 
able set of entity identifiers S (or identifiers for short) and set of prop- 
erty values V). Identifiers are themselves partitioned into k subsets 
. . . called entity-sets. 
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A. 2 (Finiteness). Every interpretation is di finite set of atoms. 



A. 3 (Typed relations). For every relation, every column consists of either 
property values or identifiers from one particular entity-set. The type 
signature for a relation r/m G 7?. is an expression of the form 

r(namei :: type^, . . . , name^ :: type^) 

where, for all j = 1, ... , m, type^- G {^i, . . . ^£k^^} and namCj is the 
name of the j-th column of r. If column j does not have type V, then its 
name can optionally include a role field using the syntax name^Qrolej. 
If unspecified, role^ is set to j by default. 

The type signature is closely related to the type and mode declarations 
that are typical in logical and relational learning. 

A. 4 (Keys). The primary key of every relation consists of the columns 
whose type belongs to {^i, . . . ^S^}. The relational arity of a relation 
is the length of its primary key. As a special case, relations of zero 
relational arity are admitted and they must consist of at most a single 
ground atom in any interpret at ioxij^ 

A. 5 (E-relations and R-relations) . For every entity-set Si there is a dis- 
tinguished relation r/m gTZ that has relational arity 1 and key of type 
£i. These distinguished relations are called E-relations and describe 
entity-sets, possibly with (m — 1) attached properties as satellite data. 
The remaining \TZ\ — k relations are called R-relations or relationship^^ 
which may also have properties. Thus, primary keys for R-relations are 
tuples of foreign keys. 

While the above assumptions put some limitations on the expressiveness 
of the language, kLog can effectively represent a wide family of interesting 
relational learning problems. The assumptions made are similar to those 
made by contemporary logical and relational learning systems [4[ [5l |3j . 

The main difference between the kLog and the E/R data models is per- 
haps the restriction that primary keys must be tuples of identifiers. As it 



^This kind of relations is useful to represent global properties of an interpretation (see 
Example 3.1 ). 

^ Note that the word "relationship" specifically refers to the association among entities 
while "relation" refers the more general association among entities and properties. 
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will become more clear in the following, identifiers are just placeholders and 
are kept separate from property values so that learning algorithms will not 
rely directly on their values to build a decision functionj^ 



2.2. An Illustration 

To illustrate the above framework, consider the E/R diagram shown in the 
left side of Figure |ll The E/R dia gram can be transformed straightforwardly 
into the following type signatures: 



The relation in is an i?-relation, while shape is an ^'-relation. The type self 
refers to the fact that the first argument of the relation shape contains the pri- 
mary key of the relation itself. Furthermore, all references to the type shape 
will refer to that column. The primary key of in consists of the columns si 
and s2, which play the container and contained roles, respectively. Symmet- 
ric or undirected relationships can be declared by careful use of role fields. 
The middle part of Figure [T] contains one possible interpretation using this 
E/R diagram or signature. It contains ground atoms such as shape(0, green, 
square), where shape is the relation, an identifier and green and square two 
property values. 

2.3. Extensional and intensional relations 

An interpretation is a set of atoms but, as in deductive databases, these 
atoms can be either listed explicitly, or they can be deduced using rules. In 
kLog this is realized by distinguishing between extensional and intensional 
signatures. In the former case, all atoms have to be listed explicitly; in the 
latter, they may be defined through definite clauses as used in Prolog and 
Datalog. 

A definite clause is an expression of the form h \- 6i, . . . , 6m where h and 
the bi are atoms, that is, expressions . • • ^tn) where p is a relation of arity 
n and the tj are terms, that is, constants or variables. As in Prolog, we will 
use lowercase names for constants and uppercase names for variables. The 
clause basically states that whenever the bjO are true for some substitution 9 



^To make an extreme case illustrating the concept, it makes no sense to predict whether 
a patient suffers from a certain disease based on his/her social security number. 



shape(shapeid::self,color::property,outline:: property) 
in(sl@container::shape,s2@contained::shape) 



(1) 

(2) 
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also h9 must be true. Applying a substitution 9 = {Vi/ti^ . . . ^Vk/tk}^ where 
the Vj are variables and tj terms, to an atom a yields the atom a9 where all 
variables Vj in a are simultaneously replaced by the corresponding terms tj. 

In the previous example, we can now add the intensional predicate inside 
with type signature inside(container::shape,contained::shape) using the follow- 
ing definite clauses: 

inside(X,Y) :- in(X,Y). (3) 
inside(X,Y) :- in(X,Z), inside(Z,Y). (4) 

The first clause states that whenever Y is in X, it is also inside X. The second 
states that whenever there exists a Z such that Z is in X and Y is inside 
Z that also Y is inside X. Thus these clauses define inside as the transitive 
closure of in. In the example of Figure [!} the intensional definition of inside 
is equivalent to the following extensional one: 

inside(0,2). inside(3,4). inside(l,3). inside(l,2). inside(l,4). inside(l,4). 

This set of atoms will be computed by kLog using standard techniques of 
logic programming and deductive databases. Formally speaking, the least 
Herbrand model of the knowledge base will be computecQ 

In kLog it will be assumed that clauses are range-restrictecl^ Note also 
that it is assumed that the type signatures of predicates used in clauses match 
with one another. Furthermore, as kLog is built on top of Prolog it will allow 
the user to take advantage of many of the extensions of definite clause logic 
that are built into Prolog. 

The ability to specify intensional predicates through clauses is most useful 
for introducing background knowledge in the learning process. As explained 



in Section |4.2[ features for the learning process are derived from a graph 
whose vertices are ground facts in the database; hence the ability of declaring 
rules that specify relations directly translates into the ability of designing and 
maintaining features in a declarative fashion. This is one key characteristic 



^The implementation of kLog in Yap-Prolog does employ the Prolog execution mecha- 
nism extended with tabling to compute the extension of a predicate. Tabling is a reasoning 
technique that stores or "memoizes" answers to queries and guarantees termination \T8\. 

predicate is range-restricted if each variable occurring in the head of a clause also 
occurs in the body of the same clause; this condition ensures that all derived facts are 
ground. 
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of kLog and, in our opinion, one of the key reasons behind the success of 
related systems hke Markov logic. 



2.4' Statistical setting 

When learning from interpretations, we assume that interpretations are 
sampled identically and independently from a fixed and unknown distribution 
D. We denote by {zi]i G X} the resulting data set, where X is a given index 
set (e.g. the first n natural numbers) that can be thought of as interpretation 
identifiers. Like in other statistical learning systems, the goal is to use a 
data sample to make some form of (direct or indirect) inference about this 
distribution. For the sake of simplicity, throughout this paper we will mainly 
focus on supervised learning. 

In the case of supervised learning, it is customary to think of data as 
consisting of two separate portions: inputs (called predictors or independent 
variables in statistics) and outputs (called responses or dependent variables). 
In our framework, this distinction is refiected in the set of ground atoms in a 
given interpretation. That is, Zi is partitioned into two sets: Xi (input ground 
atoms) and yi (output ground atoms). 

2.5. Supervised learning jobs 

A supervised learning job in kLog is a set of relations. We begin defin- 
ing the semantics of a job consisting of a single relation. Without loss of 
generality, let us assume that this relation has signature 

r(namei ::^i(i), . . . , name^ ::^^(^), name^+i :: V, . . . , name^+^:: V) (5) 

with i{j) G {1, . . . , fc} for j = 1, . . . , n. Conventionally, if n = there are 
no identifiers and if m = there are no properties. Note that because of 
Assumption A. 4 , if n > and m > then r represents a functionj^ with 



domain ^^(i) x • • • x Si(^n) and range V^. If m = then r can be seen as a 
function with Boolean range. 

Having specified a target relation r, kLog is able to infer the partition 
X U y of ground atoms into inputs and outputs in the supervised learning 
setting. The output y consists of all ground atoms of r and all ground atoms 
of any intensional relation which depends on r. The partition is inferred 
by analyzing the dependency graphs of Prolog predicates defining intensional 



^However function symbols and complex terms are not allowed in kLog. 
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relations, using an algorithm reminiscent of the call graph computation in 
ViPReSS [19]. 

We assume that the training data is a set of complete interpretation^ 
During prediction, we are given a partial interpretation consisting of ground 
atoms X, and are required to complete the interpretation by predicting output 
ground atoms y. For the purpose of prediction accuracy estimation, we will 
be only interested in the ground atoms of the target relation (a subset of y). 

Several situations may arise depending on the relational arity n and the 
number of properties m in the target relation r, as summarized in Table [2j 
When n = (see Assumption A.4), the declared job consists of predicting 
one or more properties of an entire interpretation, when n = 1 one or more 
properties of certain entities, when n = 2 one or more properties of pairs of 
entities, and so on. When m = (no properties) we have a binary classifica- 
tion task (where positive cases are ground atoms that belong to the complete 
interpretation). Multiclass classification can be properly declared by using 
m = 1 with a categorical property, which ensures mutual exclusiveness of 
classes. Regression is also declared when m = 1 but in this case the property 
is numeric. Note that property types (numerical vs. categorical) are auto- 
matically inferred by kLog by inspecting the given training data (raising an 
exception if incompatible types are mixed). An interesting scenario occurs 
when m > 1 so that two or more properties are to be predicted at the same 
time. A similar situation occurs when the learning job consists of several 
target relations. kLog recognizes that such a declaration defines a multitask 
learning job. However having recognized a multitask job does not necessar- 
ily mean that kLog will have to use a multitask learning algorithm capable 
of taking advantage of correlations between tasks (like e.g. [20j). This is 
because, by design and in line with the principles of declarative languages, 
kLog separates "what" a learning job looks like and "how" it is solved by 
applying a particular learning algorithm. We believe that the separation of 
concerns at this level permits greater fiexibility and extendability and facil- 
itates plugging-in alternative learning algorithms (a.k.a. kLog models) that 
have the ability of providing a solution to a given job. 



''kLog could be extended to deal with missing data by removing the closed world 
assumption and requiring some specification of all false groundings. 
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Table 2: Single relation jobs in kLog. 

Relational arity n 
# of 1 2 

prop- 
erties 
m 

Binary classification 
of interpretations 

1 Multiclass / regres- 
sion on interpreta- 
tions 

>1 Multitask on inter- 
pretations 



Binary classification 
of entities 

Multiclass / regres- 
sion on entities 

Multitask predic- 
tions on entities 



Link prediction 

Attributed link pre- 
diction 

Multitask attributed 
link prediction 



3. Examples 

The kLog setting encompasses a relatively large ensemble of machine 
learning scenarios, as detailed in the following examples, which we order 
according to growing complexity of the underlying learning task. 

Example 3.1 (Classification of independent interpretations). This is 
the simplest supervised learning problem with structured input data and scalar 
(unstructured) output. For the sake of concreteness, let us consider the prob- 
lem of small molecule classification as pioneered in the relational learning 
setting in [21]. This domain is naturally modeled in kLog as follows. Each 
molecule corresponds to one interpretation; there is one E-relation^ atom^ 
that may include properties such as element and charge; there is one rela- 
tionship of relational arity 2, bond^ that may include a bond_type property 
to distinguish among single, double, and resonant chemical bonds; there is 
finally a zero-arity relationship, active^ distinguishing between positive and 
negative interpretations. A concrete example is given in Section \5J\ 

Example 3.2 (Regression and multitask learning). The above example 
about small molecules can be extended to the case of regression where the task 
is to predict a real-valued property associated with a molecule, such as its bi- 
ological activity or its octanol/ water partition coefficient (logP) [22]. Many 
problems in quantitative structure- activity relationships (QSAR) are actually 
formulated in this way. The case of regression can be handled simply by 
introducing a target relation with signature activity(act:: property). 
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// there are two or more properties to be predicted^ one possibility is to 
declare several target relations^ e.g. we might add logP(logp:: property). Al- 
ternatively we may introduce a target relation such as: 

target_ properties(activity: : property, logp: : property) . 

Multitask learning can be handled trivially by learning independent pre- 
dictors; alternatively^ more sophisticated algorithms that take into account 
correlations amongst tasks (such as 123] ) could be used. 

Example 3.3 (Entity classification). A more complex scenario is the col- 
lective classification of several entities within a given interpretation. We 
illustrate this case using the classic WebKB domain l2^l The data set con- 
sists of Web pages from four Computer Science departments and thus there 
are four interpretations: Cornell^ texas^ Washington^ and Wisconsin. In this 
domain there are two E-relations: page (for webpages) and link (for hyper- 
textual links). Text in each page is represented as a bag-of-words (using the 
R-relation hasj and hyperlinks are modeled by the R-relations link_to and 
link_from. Text associated with hyperlink anchors is represented by the R- 
relation has_anchor. 

The goal is to classify each Web page. There are different data modeling 
alternatives for setting up this classification task. One possibility is to intro- 
duce several unary R-relations associated with the different classes, such as 
course^ faculty^ project^ and student. The second possibility is to add a prop- 
erty to the entity-set page^ called category^ and taking values on the different 
possible categories. R may seem that in the latter case we are just reifying 
the R-relations describing categories. However there is an additional subtle 
but important difference: in the first modeling approach it is perfectly legal 
to have an interpretation where a page belongs simultaneously to different 
categories. This becomes illegal in the second approach since otherwise there 
would be two or more atoms of the E-relation page with the same identifier. 

From a statistical point of view, since pages for the same department are 
part of the same interpretation and connected by hyperlinks, the corresponding 
category labels are interdependent random variables and we formally have an 
instance of a supervised structured output problem J2^ (that in this case 
might also be referred to as collective classification ml)- There are however 
studies in the literature that consider pages to be independent (e.g. f26^). 
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Example 3.4 (One-interpretation domains). We illustrate this case on 
the Internet Movie Database (IMDh) data set. Following the setup in f2l^ . 
the problem is to predict %lockbuster^^ movies, i.e. movies that will earn more 
than $2 million in their opening weekend. The entity-sets in this domain are 
movie^ studio^ and individual. Relationships include acted_in(actor::individual, 
m::movie), produced(s::studio, m::movie), and directed(director::individual, m::movie). 
The target unary relation blockbuster(m::movie) collects positive cases. Train- 
ing in this case uses a partial interpretation (typically movies produced before 
a given year). When predicting the class of future movies, data about past 
movies receipts can be used to construct features (indeed, the count of block- 
buster movies produced by the same studio is one of the most informative 
features [28] ). 

A similar scenario occurs for protein function prediction. Assuming data 
for just a single organism is available, there is one entity set ( proteinj and a 
binary relation interact expressing protein-protein interaction f2^ IS^ . 

Example 3.5 (Link prediction). In this case the goal is to predict whether 
two (or possibly more) entities are related. One example is the entity -resolution 
problem studied in ^SJJ where one goal is to predict, for every pair of strings 
representing author names, whether the two strings refer to the same person. 

4. Graphicalization and feature generation 

The goal is to map an interpretation z = (x^y) into a feature vector 
= y) G J-'. This enables the application of several supervised learn- 
ing algorithms that construct linear functions in the feature space J-". In 
this context can be either computed explicitly or defined implicitly, via 
a kernel function K{z^z^) = Kernel-based solutions are very 

popular, sometimes allow faster computation, and allow infinite-dimensional 
feature spaces. On the other hand, explicit feature map construction may 
offer advantages in our setting, in particular when dealing with large scale 
learning problems (many interpretations) and structured output tasks (expo- 
nentially many possible predictions). Our framework is based on two steps: 
first an interpretation z is mapped into an undirected labeled graph Gz] then 
a feature vector (j){z) is extracted from G^. Alternatively, a kernel function 
on pairs of graphs K{z^z') = K(Gz^Gz') could be computed. The corre- 
sponding potential function is then defined directly as F{z) = w^(j){z) or as 
a kernel expansion F{z) = CiK{z^ Zi). 
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There are several motivations that justify the intermediate graphicahza- 
tion step. First, and perhaps most importantly, graphicalization is a novel 
technique that is related to propositionalization, a well-known technique in 
logical and relational learning [5j to transform a relational representation 
into a propositional one. The motivation for propositionalization is that one 
transforms a rich and structured (relational) representation into a simpler 
and flat representation in order to be able to apply a learning algorithm 
that works with the simple representation afterwards. To the best of the 
authors' knowledge, current propositionalization techniques typically trans- 
form graph-based or relational data into an attribute- value learning format, 
or possibly into a multi-instance learning on^ but not into a graph-based 
one. The graphicalization approach that we introduce does not transform 
the data into an attribute- value form but rather into a graph-based format. 
This enables us to use the results on kernels for graph based data, and in 
this way upgrades these kernels to a full relational representation. Second, 
there is an extensive literature on graph kernels and virtually all existing 
solutions can be plugged into the learning from interpretations setting with 
minimal effort. This includes implementation issues but also the ability to 
reuse existing theoretical analyses. Third, it is notationally simpler to de- 
scribe a kernel and feature vectors deflned on graphs, than to describe the 
equivalent counterpart using the Datalog notation. 

The graph kernel choice implicitly determines how predicates' attributes 
are combined into features. 

^ . 7 . Graphicalization procedure 

Given an interpretation z, we construct a bipartite graph Gz{[Vz^ F^], E^) 
as follows (see Appendix C for notational conventions and Figure [l] for an 



example) . 

Vertices: there is a vertex in for every ground atom of every ^'-relation, 
and there is a vertex in for every ground atom of every i?-relation. 
Vertices are labeled by the predicate name of the ground atom, followed 
by the list of property values. Identiflers in a ground atom do not 
appear in the labels but they uniquely identify vertices. The tuple 



^In multi-instance learning [32] the examples are sets of attribute- value tuples or sets 
of feature vectors. 
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shape(4,red, square). 
shape(3,red,triangle). 
shape(2, red, square). 
shape(1, blue, square). 
shape(0, green, square). 

in(0,2). 
in(3,4). 
in(1,3). 
in(1,2). 
in(1,4). 



shape(green,sqare) 
\ 



shape(red, square) 



shape(red, square) 




shape(red, triangle) 



in > — contained- 



shape(blue, square) 



Figure 1: Graphicalization in the Bongard domain. Left: E/R diagram. Center: one 
interpretation z. Right: the corresponding Gz- 



ids('u) denotes the identifiers in the ground atom mapped into vertex 



V. 



Edges: uv G Ez if and only if G 14, G F^, and ids(2i) C ids('z;). The edge 
uv is labeled by the role under which the identifier of u appears in v 



(see Section 2.1). 



Note that, because of |A.5| for every vertex ^ G F^, the degree of v equals the 
relational arity of the matching i?-relation. The graphicalization process can 
be nicely interpreted as the unfolding of an E/R diagram over the data, i.e. 
the E/R diagram is a template that is expanded according to the given ground 
atoms (see Figure [T]). There are several other examples in the literature 
where a graph template is expanded into a ground graph, including the plate 
notation in graphical models |33|, encoding networks in neural networks for 
learning data structures [3^, and the construction of Markov networks in 
Markov logic [13j. The semantics of kLog graphs for the learning procedure 
is however quite different and intimately related to the concept of graph 
kernels, as detailed in the following section. 



4-2. Graph Kernel 

Learning in kLog is performed using a suitable graph kernel on the graphi- 
calized interpretations. While in principle any graph kernel can be employed, 
there are several requirements that the chosen kernel has to meet in prac- 
tice. On the one hand the kernel has to allow fast computations, especially 
with respect to the graph size, as the grounding phase in the graphicalization 
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procedure can yield very large graphs. On the other hand we need a general 
purpose kernel with a flexible bias as to adapt to a wide variety of application 
scenarios. 

In the current implementation we use an extension of a recently intro- 
duced ([IE]) fast graph kernel called Neighborhood Subgraph Pairwise Dis- 
tance Kernel (NSPDK). While the original kernel is suitable for sparse graphs 
with discrete vertex and edge labels, here we propose an extension to deal 
with a larger class of graphs whose labels are tuples of mixed discrete and 
numerical types. In the following sections we introduce the notation and give 
a formal deflnition of the original as well as the enhanced graph kernel. 

4^2.1. Kernel Definition and Notation 

The NSPDK is an instance of a decomposition kernel, where ^^parts^^ are 
pairs of subgraphs (for more details on decomposition kernels see [Appendix 



D). For a given graph G = (V^E)^ and an integer r > 0, let N^{G) denote 
the subgraph of G rooted in v and induced by the set of vertices 

V; = {xeV : d\x,v) < r}, (6) 

where d^{x^v) is the shortest-path distance between x and '^[^ A neighbor- 
hood is therefore a topological hall with center v. Let us also introduce the 
following neighborhood-pair relation: 

Rr4 = {(A^;(G), A^;(G),G) : rf"(^,^) = 4 (7) 

that is, relation R^^d identifles pairs of neighborhoods of radius r whose roots 
are exactly at distance d. We deflne i<ir4 over graph pairs as the decomposition 
kernel on the relation Rr^^ that is: 

/.,,,(G,GO= E <{AB),{A\B^)) (8) 

A,BeR-](G) 

where R~\{G) indicates the multiset of all pairs of neighborhoods of radius 
r with roots at distance d that exist in G. 

We can now obtain a flexible parametric family of kernel functions by 
specializing the kernel n. The general structure of is: 

k{{A, B), (A', B')) = KrootiiA, B), {A', B'))K,^,graphiiA, B), (A', B')). (9) 



Conventionally d*{x, u) = oo if no path exists between x and v 
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In the following we assume 

l^root{{^, B), {A , B')) = li,{r{A))=i{r{A'))'^l{r{B))=l{r{B')) (10) 

where 1 denotes the indicator function, r{A) is the root of A and l{v) the label 
of vertex v. The role of Kroot is to ensure that only neighborhoods centered 
on the same type of vertex will be compared. Assuming a valid kernel for 
/^subgraph (i^ the foUowing Sections we give details on concrete instantiations), 
we can finally define the NSPDK as: 

K{G,G') = J2J2^rAG,G'). (11) 

r d 

For efficiency reasons we consider the zero-extension of K obtained by impos- 
ing an upper bound on the radius and the distance parameter: K^^^d* {G^ G') — 
Sr=o Sd=o '^r.diG^ G'), that is, we limit the sum of the tir.d kernels for all in- 
creasing values of the radius (distance) parameter up to a maximum given 
value r* (rf*). Furthermore we consider a normalized version of /^^^^, that 
is: krd{G,G') = i^rA^^^) ensure that relations induced by all 

values of radii and distances are equally weighted regardless of the size of the 
induced part sets. 

Finally, it is easy to show that the Neighborhood Subgraph Pairwise 
Distance Kernel is a valid kernel as: 1) it is built as a decomposition kernel 
over the countable space of all pairs of neighborhood subgraphs of graphs of 
finite size; 2) the kernel over parts is a valid kernel; 3) the zero-extension to 
bounded values for the radius and distance parameters preserves the kernel 
property; and 4) so does the normalization step. 

4^3. Subgraph Kernels 

The role of tvsubgraph is to compare pairs of neighborhood graphs extracted 
from two graphs. The application of the graphicalization procedure to diverse 
relational datasets can potentially induce graphs with significantly different 
characteristics. In some cases (i.e. discrete properties domains) an exact 
matching between neighborhood graphs is appropriate, in other cases how- 
ever (i.e. continous properties domains) it is more appropriate to use a soft 
notion of matching. 

In the following sections we introduce variants of i^suhgraph to be used when 
the atoms in the relational dataset can maximally have a single discrete or 
continuous property, or when more general tuples of properties are allowed. 
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4^4' Exact Graph Matching 

An important case is when the atoms, that are mapped by the graphi- 
cahzation procedure to the vertex set of the resulting graph, can maximaUy 
have a single discrete property. In this case an atom r(c) becomes a vertex 
i;, whose label is obtained by concatenation of the signature name and the 
attribute value. 

In this case /^subgraph has the following form: 

l^subgraph{{A, B) , {A , B')) = Ia^A'^B^B' (12) 

where 1 denotes the indicator function and = isomorphism between graphs. 
Note that Ia^a' is a valid kernel between graphs under the feature map (/)ci 
that transforms A into (/)ci(^), a sequence of all zeros except the i-th element 
equal to 1 in correspondence to the identifier for the canonical representation 
oiA 



4.4-1' Graph Invariant 



Evaluating the kernel in Equation 12 requires as a subroutine graph iso- 



morphism, a problem for which it is unknown whether polynomial algorithms 
exist. Algorithms that are in the worst case exponential but that are fast 
in practice do exist [35l [36j. For special graph classes, such as bounded de- 
gree graphs [37j, there exist polynomial time algorithms. However, since it is 
hard to limit the type of graph produced by the graphicalization procedure 
(e.g. cases with very high vertex degree are possible as in general an entity 
atom may play a role in an arbitrary number of relationship atoms) , we pre- 
fer an approximate solution with eflSciency guarantees based on topological 
distances similar in spirit to | I38] . 

The key idea is to compute an integer pseudo-identifier for each graph 
such that isomorphic graphs are guaranteed to bear the same number (i.e., 
the function is graph invariant), but non-isomorphic graphs are likely to bear 
a different number. A trivial identity test between the pseudo-identifiers then 
approximates the isomorphism test. The reader less interested in the techni- 
cal details of the kernel may want to skip the remainder of this subsection. 

We obtain the pseudo-identifier by first constructing a graph invariant en- 
coding C^iGh) for a rooted neighborhood graph Gh- Then we apply a hash 
function H{C^(Gh)) ^ N to the encoding. Note that we cannot hope to ex- 
hibit an efficient certificate for isomorphism in this way, and in general there 
can be collisions between two non-isomorphic graphs, either because these 
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are assigned the same encoding or because the hashing procedure introduces 
a coUision even when the encodings are different. 

Hashing is not a novel idea in machine learning; it is commonly used e.g. 
for creating compact representations of molecular structures [39j and has been 
advocated as a tool for compressing very high dimensional feature spaces [40] . 
In the present context hashing is mainly motivated by the computational 
efficiency gained by approximating the isomorphism test. 

The graph encoding C^{Gh) that we propose is best described by intro- 
ducing two new label functions: one for the vertices and one for the edges, 
denoted and respectively. C^{v) assigns to vertex v a lexicographically 
sorted sequence of pairs composed by a topological distance and a vertex la- 
bel, that is, C^{v) returns a sorted list of pairs {V{v^u)^ i{u)) for all u G Gh- 
Moreover, since Gh is a rooted graph, we can use the knowledge about the 
identity of the root vertex h and prepend to the returned list the additional 
information of the distance from the root node V{v^h). The new edge label 
is produced by composing the new vertex labels with the original edge label, 
that is C^{uv) assigns to edge uv the triplet {C^ (u) ^ (v) ^ i{uv)) . Finally 
C^iGh) assigns to the rooted graph Gh the lexicographically sorted list of 
C^{uv) for all uv G E{Gh)' In words: we relabel each vertex with a sequence 
that encodes the vertex distance from all other (labeled) vertices (plus the 
distance from the root vertex); the graph encoding is obtained as the sorted 
edge list, where each edge is annotated with the endpoints' new labels. For 
a proof that C^{Gh) is graph invariant, see [HI p. 53]. 

We finally resort to a Merkle-Damgard construction based hashing func- 
tion for variable-length data to map the various lists to integers, that is, we 
map the distance-label pairs, the new vertex labels, the new edge labels and 
the new edge sequences to integers (in this order). Note that it is trivial to 
control the size of the feature space by choosing the hash codomain size (or 
alternatively the bit size for the returned hashed values) 

4.5. Soft Matches 

The idea of counting exact neighborhood subgraphs matches to express 
graph similarity is adequate when the graphs are sparse (that is, when the 
edge and the vertex set sizes are of the same order) and when the maximum 



-'^^ Naturally there is a tradeoff between the size of the feature space and the number of 
hash collisions. 
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Figure 3: Graph invariant computation: the graph hash is computed as the hash of the 
sorted hst of edge hashes. An edge hash is computed as the hash of the sequence of the 
two endpoints hashes and the edge label. The endpoint hash is computed as the hash of 
the sorted list of distance-vertex label pairs. 



vertex degree is low. However, when the graph is not sparse or some vertices 
exhibit large degrees, the likelihood that two neighborhoods match exactly 
quickly approaches zero, hence the similarity notion becomes degeneratecf^ 
In these cases a better solution is to relax the all-or-nothing type of match 
and allow for a partial or soft match between subgraphs. Although there 
exist several graph kernels that allow this type of match, they generally 
suffer from very high computational costs [l2]. To ensure efficiency, we use 
an idea introduced in the Weighted Decomposition Kernel ([43j): given a 
subgraph, we consider only the multinomial distribution (i.e. the histogram) 
of the labels, discarding all structural information. In the soft match kernel, 
the comparison between two pairs of neighborhood subgraphs is replaced 



^^A concrete example is when text information associated to a document is modeled 
explicitly, i.e. when word entities are linked to a document entity: in this case the degree 
corresponds to the document vocabulary size. 
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Figure 4: Illustration of the soft matching kernel. Only features generated by a selected 
pair of vertices are represented: vertices A and B at distance 1 yield a multinomial dis- 
tribution of the vertex labels in neighborhoods of radius 1. On the right we compute the 
contribution to the kernel value by the represented features. 

(13) 

V G V{A)UV{B) 
v' G V{A') U V{B') 

where V{A) is the set of vertices of A. In words, we count the vertices 
that share the same label in either one or the other of the neighbourhood 
subgraphs. 

4.6. Tuples of properties 

A standard assumption in graph kernels is that vertex and edge labels are 
elements of a discrete domain. However, in kLog the information associated 
with vertices is a tuple that can contain both discrete and real values. Here 



^^Note that the pair of neighborhood subgraphs are considered jointly, i.e. the label 
multisets are extracted independently from each subgraph in the pair and then combined 
together. 
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we extend the NSPDK model to allow both a hard and a soft match type 
over graphs with multiple properties of mixed types. 

The general structure of the kernel on the subgraph can be written as: 

l^subgraphiiA B) , {A, B')) = ^ '^i{v)=i{v')l^tuple{v , v') (14) 

V e V{A) U V{B) 
v' G V{A')UV{B') 



where, for an atom-^^ r(ci, C2, . . . , c^) mapped into vertex £{v) returns the 
signature name r. 1^ subgraph is a kernel that is defined over sets of vertices 
(atoms) and can be decomposed in a part that ensures matches between 
atoms with the same signature name, and a second part that takes into 
account the tuple of property values. In particular, depending on the type of 
property values and the type of matching required, we obtain the following 
cases. 

Soft match for discrete tuples We consider each element of the tuple 
independently: 

^tuplei^"^ ) ) ^ ^ '^propd{v)=propd{v') (-^^) 
d 

where for an atom r(ci, C2, . . . , Q, . . . , c^) mapped into vertex propd{v) 
returns the property value q. 

Hard match for discrete tuples We replace the label £{v) in Equa- 
tion 14 with the labeling procedure as detailed in Section |4.4.1[ In this 



way each vertex receives a canonical label that uniquely identifies it in the 
neighborhood graph. Finally we consider each element of the tuple jointly: 

^tupleiy ) ^ ) '^prop(i{y)=prop(i{y') (-^6) 

d 

where the symbols have the same meaning as in the soft match for discrete 
tuples case. In practice this is equivalent to the hard match as detailed in 



Section 4.4 where the property value is replaced with the concatenation of 



all property values in the tuple. 



^^We remind the reader that in the graphicalization procedure we remove the primary 
and foreign keys from each atom, hence the only information available at the graph level 
are the signature name and the properties values. 



23 



Soft match for real tuples To upgrade the soft match kernel to tuples 
of real values we replace the exact match with the standard product^, The 
kernel on the tuple then becomes: 

i^tupie{v, v') = ^propc{v) • propc{v') (17) 

c 

Hard match for real tuples We proceed in an analogous fashion as for 
the hard match for discrete tuples^ but we combine the real valued tuple of 
corresponding vertices with the standard product as in Equation [T7| That 



is, we replace the label i{v) in Equation 14 with the labeling procedure 
and we use: 

i^tupie{v, v') = ^propc{v) • propc{v'). (18) 

c 

Soft match for mixed discrete and real tuples When dealing with 
tuples of mixed discrete and real values, the contribution of the kernels on 
the separate collections of discrete and real attributes are combined via sum- 
mation: 

l^tupUv, V') = ^ lpropd{v)=propd{v') + ^pr0Pc{v) • pr0Pc{v') (19) 
d c 

where indices d and c run exclusively over the discrete and continous prop- 
erties respectively. 

Hard match for mixed discrete and real tuples In an analogous 
fashion, provided that l{y) in Equation 14 is replaced with the labeling pro- 
cedure as detailed in Section [4.4. 1[ we have: 

tuple '^propd{v)=propd{y') + ^^^^^propc{v) ' propc{v) (20) 

d c 

In this way each vertex receives a canonical label that uniquely identifies it 
in the neighborhood graph. The discrete labels of corresponding vertices are 
concatenated and matched for identity, while the real tuples of corresponding 
vertices are combined via the standard dot product. 



^^Note that this is equivalent to collecting all numerical properties of a vertex's tuple in 
a vector and then employ the standard dot product between vectors. 
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4.6.1. Domain Knowledge Bias via Kernel Points 

At times it is convenient, for efficiency reasons or to inject domain knowl- 
edge into the kernel, to be able to explicitly select the neighborhood sub- 
graphs. We provide a way to do so, declaratively, by introducing the set of 
kernel points^ a subset of V{G) which includes all vertices associated with 
ground atoms of some specially marked signatures. We then redefine the 



relation Rr^di^^B^G) used in Equation [8| like in Section 4.2.1 but with the 
additional constraints that the roots of A and B be kernel points. 

Kernel points are typically vertices that are believed to represent infor- 
mation of high importance for the task at hand. Vertices that are not kernel 
points contribute to the kernel computation only when they occur in the 
neighborhoods of kernel points. In kLog, kernel points are declared as a list 
of domain relations: all vertices that correspond to ground atoms of these 
relations become kernel points. 

4^6.2. Viewpoints and i.i.d. view 

The above approach effectively defines a kernel over interpretations 

Kiz,z') = K{G,,G,,) 

where is the result of graphicalization applied to interpretation z. For 
learning jobs such as classification or regression on interpretations (see Ta- 
ble [2]), this kernel is directly usable in conjunction with plain kernel ma- 
chines like SVM. When moving to more complex jobs involving e.g. classi- 
fication of entities or tuples of entities, the kernel induces a feature vector 
y) suitable for the application of a structured output technique where 
f{x) = arg max^; w'(l){x^ y). Alternatively, we may convert the structured out- 
put problem into a set of i.i.d. subproblems as follows. For simplicity, assume 
the learning job consists of a single relation r of relational arity n. We call 
each ground atom of r a case. Intuitively, cases correspond to training targets 
or prediction-time queries in supervised learning. Usually an interpretation 
contains several cases corresponding to specific entities such as individual 



Web pages (as in Section 3.3) or movies (as in Section 3.4), or tuples of enti 



ties for link prediction problems (as in Section 3.5). Given a case c G the 
viewpoint of c, Wc, is the set of vertices that are adjacent to c in the graph 
(see bottom of Figure [g] for an illustration) . We then consider the mutilated 
graph Gc where all vertices in ^, except c, are removed. We then define a ker- 
nel k on mutilated graphs like the NSPDK but with the additional constraint 
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that the first endpoint must be in Wc, using the decomposition 
Rr,d = {{A, B, a) : iV;, V e W,, d*{u, v) = d} 
We obtain in this way a kernel "centered" around case c: 

k{G^,G',) = Y^ S{A,A')S{B,B') 

^'.■B'Gi?-i(G;,) 

and finaUy we let 

This kernel corresponds to the potential 

c 

which is clearly maximized by maximizing, independently, all sub-potentials 
w'(/){x^ c) with respect to c. 

By following this approach we do not obtain a collective prediction (indi- 
vidual ground atoms are predicted independently). Still, even in this reduced 
setting, the kLog framework can be exploited in conjunction with meta- 
learning approaches that surrogate collective prediction. For example Prolog 
predicates in intensional signatures can effectively be used as expressive re- 
lational templates for stacked graphical models [44j where input features for 
one instance are computed from predictions on other related instances. Re- 
sults in Section [5]2] for the "partial information" setting are obtained using a 
special form of stacking. 

5. kLog in practice 

In this section we illustrate the use of kLog in a number of application 
domains. kLog is currently embedded in Yap Prolog and consists of three 
main components: (1) a domain-specific interpreter, (2) a database loader, 
and (3) a library of predicates that are used to specify the learning task, to 
declare the learning model, and to perform training, prediction, and perfor- 
mance evaluation. A domain specification is a set of signature declarations 
(see Section [2]), possibly enriched with intensional and auxiliary predicates. 



26 



With functional groups 



Atom bond 




Figure 5: Results (AUROC, 10- fold cross-validation) on the Bursi data set with (left) 
and without (right) functional groups. Lines on the x axis indicate kernel and SVM 
hyperparameters (from top to bottom: maximum radius maximum distance and 
regularization parameter C. 

The database loader reads in a file containing extensional ground facts and 
generates a graph for each interpretation, according to the procedure outlined 
in Section |4} The library includes common utilities for training and testing. 
Most of kLog is written in Prolog except feature vector generation and train- 
ing, which is written in C++. kLog can interface with several solvers for 
parameter learning including LibSVM and SVM Stochastic gradient de- 
scent [46j. All experimental results reported in this section were obtained 
using LibSVM in binary classification, multiclass, or regression mode, as ap- 
propriate. 

5.1. Predicting a single property of one interpretation 

We now expand the ideas outlined in Example |3.1[ Predicting the bi- 
ological activity of small molecules is a major task in chemoinformatics 
and can help drug development [47j and toxicology ^48^ i49j. Most exist- 
ing graph kernels have been tested on data sets of small molecules (see 
e.g. [50[ ^3 HI ^3 From the kLog perspective the data consists of 

several interpretations, one for each molecule. In the case of binary classi- 
fication (e.g. active vs. nonactive), there is a single target predicate whose 
truth state corresponds to the class of the molecule. To evaluate kLog we 
used two data sets. The Bursi data set [54j consists of 4,337 molecular 
structures with associated mutagenicity labels (2,401 mutagens and 1,936 
nonmutagens) obtained from a short-term in vitro assay that detects genetic 
damage. The Biodegradability data set [55] contains 328 compounds and the 
regression task is to predict their half-life for aerobic aqueous biodegradation 
starting from molecular structure and global molecular measurements. 
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Table 3: Results on biodegradability. 
Setting RMSE SCC MAPE 



Functional groups 1.07 ± 0.01 0.54 ± 0.01 14.01 ± 0.08 
Atom bonds 1.13 ± 0.01 0.48 ± 0.01 14.55 ± 0.12 



Appendix B . 2| shows a fragment of kLog domain specification for Bursi. 
Relevant predicates in the extensional database are a/2, b/3 (atoms and 
bonds, respectively, extracted from the chemical structure), sub/3 (func- 
tional groups, computed by DMax Chemistry Assistant [56l E?]), fused/3, 
connected/4 (direct connection between two functional groups), linked/4 (con- 
nection between functional groups via an aliphatic chain) . Aromaticity (used 
in the bond-type property of b/3) was also computed by DMax Chemistry 
Assistant. The intensional signatures essentially serve the purpose of simpli- 
fying the original data representation. For example atm/2 omits some entities 
(hydrogen atoms), and fg_fused/3 replaces a list of atoms by its length. The 
target relation mutagenic has relational arity zero since it is a property of 
the whole interpretation. As shown in Figure [5| results are relatively sta- 
ble with respect to the choice of kernel hyperparameter (maximum radius 
and distance) and SVM regularization and essentially match the best results 
reported in p£j (AUROC 0.92 ±0.02) even without composition with a poly- 
nomial kernel. These results are not surprising since the graphs generated 
by kLog are very similar in this case to the expanded molecular graphs used 
in pj. 

kLog code for biodegradability is similar but being a regression task we 
have a target relation declared as 

signature biodegradation(/7a/////e;;property)::extensional. 

We estimated prediction performance by repeating five times a ten-fold 
cross validation procedure as described in |55j (using exactly the same folds 
in each trial). Results — rooted mean squared error (RMSE), squared corre- 
lation coefficient (SCC), and mean absolute percentage error (MAPE) — are 
reported in Table jS) For comparison, the best RMSE obtained by kFOIL on 
this data set (and same folds) is L 14 ± 0.04 (kFOIL was shown to outperform 
Tilde and S-CART in ^). 

5.2. Link prediction 

We illustrate kLog in the well known University of Washington domain, 
where the task consists of predicting which students are advised by which pro- 
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fessors given background information such as paper authorship, enroUment 
data, and courses. The data set was originaUy created to iUustrate feature 
declaration for a hnk prediction task using Markov logic [13] . Features anal- 
ogous to those in Markov logic are easy to declare in kLog. For example 
a useful feature of a given student-professor pair could be whether they co- 
authored a paper, which is obtainable in kLog by signature on_same_paper(s::student,p 



(see code in Appendix B.l). It is similarly easy to count the number of com- 
mon papers using the code snippet below: 



signature njcommon jpapers{s::student,p::professor,n::property)::\n{en^^^^ 
n_common_papers(S,P,N) :- 
student(S), professor(P), 

setof(Pub, (publication(Pub, S), publication(Pub,P)), CommonPapers), 
length(CommonPapers,N). 



Unlike Markov logic features (which are counts of true ground atoms of for- 
mulae in the given interpretation) , kLog features combine deduced facts into 
more complex features thanks to the graph kernel. Figure [6] shows a fragment 
of the AI group interpretation in the UWCSE domain. Dashed diamonds are 
nodes corresponding to the target relation and not used for feature genera- 
tion. 

To assess kLog behavior we evaluated prediction accuracy according to 



the leave-one-research-group-out setup of [13], using the script of Appendix 



|B.l I together with a NSPDK kernel with distance 2, radius 2, and soft match. 
Comparative results with respect to Markov logic are reported in Figure [7] 
(MLN results published in [13j). The whole 5-fold procedure runs in about 20 
seconds on a 2.5GHz single core. Compared to MLNs, kLog in the current im- 
plementation has the disadvantage of not performing collective assignment 
but the advantage of defining more powerful features thanks to the graph 
kernel. Additionally, MLN results use a much larger knowledge base. The 
advantage of kLog over MLN in Figure [7| is due to the more powerful fea- 
ture space. Indeed, when setting the graph kernel distance and radius to 
and 1, respectively, the feature space has just one feature for each ground 
signature, in close analogy to MLN. In this case, the performance (area un- 
der recall-precision curve, AURPC) of kLog using the same set of signatures 
drops dramatically from 0.28 to 0.09. In a second experiment we predicted 
the relation advised _ by starting from partial information (i.e. when relations 
Student (and its complement Professor) are unknown, as in [13j). In this 
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Figure 6: Simplified UWCSE domain. Top Left: E/R diagram; top right: fragment on 
the AI interpretation. Bottom: same fragment showing a sample viewpoint (Wc) and the 
corresponding mutilated graph where removed vertices are grayed out and the target case 
c highlighted. 
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case we created a pipeline of two predictors. Our procedure is reminiscent of 
stacked generalization [59j. In the first stage a leave-one-research-group-out 
cross-validation procedure applied to the training data to obtain predicted 
groundings for Student (a binary classification task on entities). Predicted 
groundings are then fed to the second stage which predicts the binary rela- 
tion advised _ by. The overall procedure is repeated using one research group 
at the time for testing. Results are reported in Figure [8| 

Since kLog is embedded in the programming language Prolog, it is easy 
to use the output of one learning task as the input for the next one as 
illustrated in the pipeline. This is because both the inputs and the outputs 
are relations. Relations are treated uniformly regardless of whether they are 
defined intensionally, extensionally, or are the result of a previous learning 
run. Thus kLog satisfies what has been called the closure principle in the 
context of inductive databases [W, ^61] ; it is also this principle together with 
the embedding of kLog inside a programming language (Prolog) that turns 
kLog into a true programming language for machine learning [62l [631 l64] . 
Such programming languages possess — in addition to the usual constructs 
— also primitives for learning, that is, to specify the inputs and the outputs of 
the learning problems. In this way they support the development of software 
in which machine learning is embedded without requiring the developer to 
be a machine learning expert. The development of such languages is a long 
outstanding research question according to Mitchell [62] . 



5.3. Entity classification 

The WebKB data set [24j has been widely used to evaluate relational 
methods for text categorization. It consists of academic Web pages from 
four computer science departments and the task is to identify the category 
(such as student page, course page, professor page, etc). Figure [9] shows 
the E/R diagram used in kLog. One of the most important relationships in 
this domain is has, that associates words to web pages. After graphicaliza- 
tion, vertices representing webpages have large degree (at least the number 
of words), making the standard NSPDK of [16] totally inadequate: even by 
setting the maximum distance D = 1 and the maximum radius i? = 2, the 
hard match would essentially create a distinct feature for every page. In this 
domain we can therefore appreciate the fiexibility of the kernel defined in Sec- 



tion [42} In particular, the soft match kernel creates histograms of word oc- 
currences in the page, which is very similar to the bag-of- words (with counts) 
representation that is commonly used in text categorization problems. The 
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Figure 7: Comparison between kLog and MLN on the UWCSE domain (all information). 




Figure 8: Comparison between kLog and MLN on the UWCSE domain (partial informa- 
tion). 
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Figure 9: WebKB domain. 



Table 4: Results on WebKB: contingency table, accuracy, precision, recall, and Fi measure 
per class. Last row reports micro- averages. 





research 


faculty 


course 


student 


A 


P 


R 


Fi 


research 


59 


11 


4 


15 


0.94 


0.66 


0.70 


0.68 


faculty 


9 


125 


2 


50 


0.91 


0.67 


0.82 


0.74 


course 
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0.99 


1.00 


0.95 


0.98 


student 


16 


17 


5 


493 


0.90 


0.93 


0.88 


0.91 


Average 


0.88 


0.89 


0.88 


0.88 



additional signature cs_in_url embodies common sense background knowl- 
edge that many course web pages contains the string "cs" followed by some 
digits and is intensionally defined using a Prolog predicate that holds true 
when the regular expression :cs(e*)[0-9]+: matches the page URL. 

Empirical results using only four Universities (Cornell, Texas, Washing- 
ton, Wisconsin) in the leave-one-university-out setup are reported in Table [ij 

5.4' Domains with a single interpretation 

The Internet Movie Database (IMDb) collects information about movies 
and their cast, people, and companies working in the motion picture industry. 
We focus on predicting, for each movie, whether its first weekend box-office 
receipts is over US$2 million, a learning task previously defined in [271 165] . 
The learning setting defined so far (learning from independent interpreta- 
tions) is not directly applicable since train and test data must necessarily 



33 



occur within the same interpretation. The notion of slicing in kLog aUows 
us to overcome this difficulty. A slice system is a partition of the true ground 
atoms in a given interpretation: z = {^(^i), . . . , z{in)} where the disjoint sets 
z{j) are called slices and the index set / = {zi, . . . is endowed with a 
total order ^. For example, a natural choice for / in the IMDb domain is 
the set of movie production years (e.g. {1996, . . . ,2005}), where the index 
associated with a ground atom of an entity such as actor is the debut year. 

In this way, given two disjoint subsets of /, T and aS, such that max^(T) ^ 
min^(5'), it is reasonable during training to use for some index t G T \ 
{min^(T)} the set of ground atoms {x{i) : i G T A i ^ t}t U {y{i) - i ^ 
T Ai ^ t}t (where i ^ t iS i and i 7^ t) as the input portion of the data, 
and {y{t)}t as the output portion (targets). Similarly, during testing we can 
for each s G S use the set of ground atoms {x{i) : i G S Ai ^ s}s U {y{i) : 
i G S A i ^ s}s for predicting {7/(5) j^. 

The kLog data set was created after downloading the whole database 
from http : / / www . imdb . coml Adult movies, movies produced outside the 
US, and movies with no opening weekend data were discarded. Persons 
and companies with a single appearance in this subset of movies were also 
discarded. The resulting data set is summarized in Table (S) We modeled 
the domain in kLog using extensional signatures for movies, persons (actors, 
producers, directors), and companies (distributors, production companies, 
special effects companies). We additionally included intensional signatures 
counting, for each movie the number of companies involved also in other 
blockbuster movies. We sliced the data set according to production year, 
and starting from year y = 1997, we trained on the frame {y — l^y — 2} and 
tested on the frame {y}. Results (area under the ROC curve) are summarized 
in Table [5] and they are on par with relational learners based on collective 
classification [2TJ. 

6. Related work 

As kLog is a language for logical and relational learning with kernels it 
is related to work on inductive logic programming, to statistical relational 
learning, to graph kernels and to propositionalization. We now discuss each 
of these lines of work and their relation to kLog. 

First, the underlying representation of the data that kLog employs at 
the first level is very close to that of standard inductive logic programming 
systems such as Progol |(66j, Aleph [^, and Tilde [68] in the sense that the 
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Table 5: Results on the IMDb data set. Years 1995 and 1996 were only used for training. 



Year 


# Movies 


# Facts 


AUROC 


1995 


74 


2483 




1996 


223 


6406 




1997 


311 


8031 


0.86 


1998 


332 


7822 


0.93 


1999 


348 


7842 


0.89 


2000 


381 


8531 


0.96 


2001 


363 


8443 


0.95 


2002 


370 


8691 


0.93 


2003 


343 


7626 


0.95 


2004 


371 


8850 


0.95 


2005 


388 


9093 


0.92 


All 






0.93 ± 0.03 



input is essentially (a variation of) a Prolog program for specifying the data 
and the background knowledge. Prolog allows us to encode essentially any 
program as background knowledge. The E/R model used in kLog is related to 
the Probabilistic Entity Relationship models introduced by Heckerman et al. 
in [69]. The signatures play a similar role as the notion of a declarative bias 
in inductive logic programming [5j. The combined use of the E/R model and 
the graphicalization has provided us with a powerful tool for visualizing both 
the structure of the data (the E/R diagram) as well as specific cases (through 
their graphs). This has proven to be very helpful when preparing datasets 
for kLog. On the other hand, due to the adoption of a database framework, 
kLog does not allow for using functors in the signature relations (though 
functors can be used inside predicates needed to compute these relations 
inside the background knowledge). This contrasts with some inductive logic 
programming systems such as Progol [66] and Aleph [67] . 

Second, kLog is related to many existing statistical relational learning 
systems such as Markov Logic [13j, Probabilistic Similarity Logic [70j, Prob- 
abilistic Relational Models |71j, Bayesian Logic Programs ^72] and ProbLog 
[73j in that the representations of the inputs and outputs are essentially 
the same, that is, both in kLog and in statistical relational learning sys- 
tems inputs are partial interpretations which are completed by predictions. 
What kLog and statistical relational learning techniques have in common is 
that they both construct (implicitly or explicitly) graphs representing the in- 
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stances. For statistical relational learning methods such as Markov Logic [13] . 
Probabilistic Relational Models [7T], and Bayesian Logic Programs [72] the 
knowledge based model construction process will result in a graphical model 
(Bayesian or Markov network) for each instance representing a class of prob- 
ability distributions while in kLog the process of graphicalization results in 
a graph representing an instance by unrolling the E/R-diagram. Statistical 
relational learning systems then learn a probability distribution using the 
features and parameters in the graphical model, while kLog learns a func- 
tion using the features derived by the kernel from the graphs. A difference 
between these statistical relational learning models and kLog is that the for- 
mer do not really have a second level as does kLog. Indeed, the knowledge 
base model construction process directly generates the graphical model that 
includes all the features used for learning, while in kLog these features are 
derived from the graph kernel. While statistical relational learning systems 
have been commonly used for collective learning, this is still a question for 
further research within kLog. A combination of structured-output learning 
[6] and iterative approaches (as incorporated in the EM algorithm) can form 
the basis for further work in this direction. 

kLog builds also upon the many results on learning with graph kernels, 
see [74j for an overview. A distinguishing feature of kLog is, however, that 
the graphs obtained by graphicalizing a relational representation contain very 
rich labels, which can be both symbolic and numeric. This contrasts with 
the kind of graphs needed to represent for instance small molecules. In this 
regard, kLog is close in spirit to the work of [75 J, who define a kernel on 
hypergraphs, where hypergraphs are used to represent relational interpreta- 
tions. A further key difference is, however, that a key feature of kLog is that 
it also provides a procedure for automatically graphicalizing relational rep- 
resentations, which also allows to naturally specify multitask and collective 
learning tasks. 

The graphicalization approach introduced in kLog is closely related to 
the notion of propositionalization, a commonly applied technique in logical 
and relational learning [761 15] to generate features from a relational repre- 
sentation. The avantage of graphicalization is that the obtained graphs are 
essentially equivalent to the relational representation and that — in contrast 
to the existing propositionalization approaches in logical and relational learn- 
ing — this does not result in a loss of information. After graphicalization, any 
graph kernel can in principle be applied to the resulting graphs. Even though 
many of these kernels (such as the one used in kLog) compute — implicitly 
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or explicitly — a feature vector, the dimensionality of the obtained vector 
is far beyond that employed by traditional propositionalization approaches. 
kFOIL [58] is one such propositionalization technique that has been tightly 
integrated with a kernel based method. It greedily derives a (small) set of 
features in a way that resembles the rule-learning algorithm of FOIL [77J. 

Other domain specific languages for machine learning have been devel- 
oped whose goals are closely related to those of kLog. Learning based 
Java [64j was designed to specifically address applications in natural lan- 
guage processing. It builds on the concept of data-driven compilation to 
perform feature extraction and nicely exploits the constrained conditional 
model framework [78] for structured output learning. FACTORIE [79] al- 
lows to concisely define features used in a factor graph and, consequently, 
arbitrarily connected conditional random fields. Like with MLN, there is an 
immediate dependency of the feature space on the sentences of the language, 
whereas in kLog this dependency is indirect since the exact feature space is 
eventually defined by the graph kernel. 

7. Conclusions 

We have introduced a novel language for logical and relational learning 
called kLog. It tightly integrates logical and relational learning with ker- 
nel methods and constitutes a principled framework for statistical relational 
learning based on kernel methods rather than on graphical models. kLog 
uses a representation that is based on E/R modeling, which is close to repre- 
sentations being used by contemporary statistical relational learners. kLog 
first performs graphicalization, that is, it computes a set of labeled graphs 
that are equivalent to the original representation, and then employs a graph 
kernel to realize statistical learning. We have shown that the kLog frame- 
work can be used to formulate and address a wide range of learning tasks, 
that it performs at least comparably to state-of-the-art statistical relational 
learning techniques, and also that it can be used as a programming language 
for machine learning. 

There are several interesting questions for further research. 

The graph kernel that is currently employed in kLog makes use of the no- 
tion of topological distances to define the concept of neighborhoods. In this 
way, given a predicate of interest, properties of "nearby" tuples are combined 
to generate features relevant for that predicate. As a consequence, when 
topological distances are not informative (e.g. in the case of dense graphs 
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with small diameter) then large fractions of the graph become accessible to 
any neighborhood and the features induced for a specific predicate cease to 
be discriminative. In these cases (typical when dealing with small- world net- 
works) kernels with a different type of bias (e.g. fiow based kernels) are more 
appropriate. The implementation of a library of kernels suitable for different 
types of graphs is therefore an important direction for future development. 

While kLog's semantics naturally allows to define structured output learn- 
ing problems, procedures for jointly optimizing the predictions in a collective 
classification setting are not available at present. 

Furthermore, even though kLog's current implementation is quite perfor- 
mant, there are interesting implementation issues to be studied. Many of 
these are similar to those employed in statistical relational learning systems 
such as Alchemy [T3] and ProbLog [80j. 

kLog is being actively used for developing applications. We are currently 
exploring applications of kLog in natural language processing ^SU |82l |83j and 
in computer vision. 
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domains begin_domain. signatures end _doma\n. 

signatures signature \ [signature signatures] 

signature header [sig_ clauses] 

sig_ clauses sig_ clause \ [sig_ clause sig_ clauses] 

sig_ clause Prolog _ clause 

header sig_ name ( args ) : : level . 

sig_ name Prolog _ atom 

args arg \ [arg args] 

arg ^ column_name [role _ overrider] :: type 

column_ name Prolog _ atom 

role_ overrider @ role 

role Prolog _ atom 

type self | sig_ name 

level intensional I extensional 



Figure A. 10: kLog syntax 

Appendix A. Syntax of the kLog domain declaration section 

A kLog program consists of Prolog code augmented by a domain declara- 
tion section delimited by the pair of keywords begin_domain and end_domain 
and one or more signature declarations. A signature declaration consists of 
a signature header followed by one or more Prolog clauses. Clauses in a 
signature declaration form the declaration of signature predicates and are 
automatically connected to the current signature header. There are a few 
signature predicates with a special meaning for kLog, as discussed in this 
section. A brief BNF description of the grammar of kLog domains is given 
in Figure |A.10[ 

Additionally kLog provides a library of Prolog predicates for handling 
data, learning, and performance measurement. 

Appendix B. Fragments of kLog scripts 

Appendix B.l. Domain specification for UWCSE 

1 begin_domain. 

2 signature has_position(p;;professor, pos/fo/7;;properfy)::extensional. 

3 signature mjphdLse{s::student, phase: :property)\\ex\er\s\on3\. 
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signature yearsJn_program(s;;sfac/e/7f, years::property)\\ex\ens\ona\. 
signature ady\sed_by(s::student, p::professor)\\ex\ens\ona\. 

signature student(s;;se/0"extensional. 
signature professor(p;;se/0"extensional. 

signature on_same_course(s ::student, p;;professor) "intensional. 
on_same_course(S,P) :- 

professor(P), student(S), 

ta(Course,S,Term), tauglit_by(Course,P,Term). 

signature on_same_500_course(s;;sfac/e/7f, p;;professor)::intensional. 
on_same_500_course(S,P) :- 

professor(P), student(S), 

ta(Course,S,Term), taught_by(Course,P,Term), 

course_level(Course,level_500). 

signature on_same_paper(s;;sfac/e/7f, p;;professor)::intensional. 
on_same_paper(S,P) :- 

student(S), professor(P), 

publication(Pub, S), publication(Pub,P). 
end domain. 



Appendix B.2. Domain specification for the Bursi task 



begin_domain. 

signature aXm{atom_id::self, element::property)\'Mens\ona\. 
atm(Atom, Element) :- a(Atom, Element), \+(Element=h). 

signature bnd{atom_1 @b::atm, atom_2@b::atm, type::property)\\\n\ens\ona\. 
bnd(Atom1,Atom2,Type) :- 

b(Atom1 ,Atom2,NType), describeBondType(NType,Type), 

atm(Atom1,_), atm(Atom2,_). 

signature igroup(fgroup_id::self, group_type::property)\\\n\ens\ona\. 
fgroup(Fg,Type) :- sub(Fg,Type,_). 

signature fgmember(/g;;/groi/p, afOA77;;afA77)::intensional. 
fgmember(Fg,Atom):- subat(Fg,Atom,_), atm(Atom,_). 

signature igJluse6(fg1@nil::fgroup, fg2@nil::fgroup, nrAtoms::property)\\\n\ens\ona\. 
fg_fused(Fg1 ,Fg2,NrAtoms):- fused(Fg1 ,Fg2,AtomList), length(AtomList,NrAtoms). 

signature ig_connecXed(fg1 @nil::fgroup, fg2@ nil ::f group, 

bondType::property)\'\n\ens\ona\. 
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fg_connected(Fg1 ,Fg2,BondType):- 

connected(Fg1,Fg2,Type,_AtomList),describeBondType(Type,BondType). 

signature fg_linked(/g;;fgroi/p, alichainr.f group, saturation::property)\'An\ens\on3\. 

fg_linked(FG,AliChain,Sat) :- 

linked(AliChain,Links,_BranchesEnds, Saturation), 

( Saturation = saturated -> Sat = saturated ; Sat = unsaturated ), 

member(link(FG,_A1 ,_A2), Links). 

signature mutagenic::extensional. 
end domain. 



Above is a kLog script for the Bursi data set used in Section |5.1[ Signature 



bnd(atom_l@b::atm,atom_2@b::atm) contains the same role field b twice, 
declaring that the two atoms play the same role in the chemical bond and 
the relation is symmetric. In this way, each bond can be represented by 
one tuple only, while a more traditional relational representation, which is 
directional, would require two tuples. While this may at first sight appear to 
be only syntactic sugar, it does provide extra abilities for modeling which is 
important in some domains. For instance, when modeling ring-structures in 
molecules, traditional logical and relational learning systems need to employ 
either lists to capture all the elements in a ring structure, or else need to 
include all permutations of the atoms participating in a ring structure. For 
rings involving 6 atoms, this requires 6! =720 different tuples, an unnecessary 
blow-up. Also, working with lists typically leads to complications such as 
having to deal with a potentially infinite number of terms. 



Appendix C. Definitions 

For the sake of completeness we report here a number of graph theoretical 
definitions used in the paper. We closely follow the notation in [84j. A graph 
G = (y, E) consists of two sets V and E. The notation V{G) and E{G) 
is used when G is not the only graph considered. The elements of V are 
called vertices and the elements of E are called edges. Each edge has a 
set of two elements in V associated with it, which are called its endpoints^ 
which we denote by concatenating the vertices variables, e.g. we represent 
the edge between the vertices u and v with uv. An edge is said to join its 
endpoints. A vertex v is adjacent to a vertex u if they are joined by an 
edge. An edge and a vertex on that edge are called incident. The degree of 
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a vertex is number of edges incident to it. A multi-edge is a collection of two 
or more edges having identical endpoints. A self-loop is an edge that joins a 
single endpoint to itself. A simple graph is a graph that has no self-loops nor 
multi-edges. A graph is bipartite if its vertex set can be partitioned into two 
subsets X and Y so that every edge has one end in X and the other in Y. We 
denote a bipartite graph G with partition (X, Y) by G{[X, Y],E). A graph is 
rooted when we distinguish one of its vertices, called root] we denote a rooted 
graph G with root vertex v with G^. A walk in a graph G is a sequence 
of vertices W = Vq^Vi^ . . . such that for j = 1, . . . , n, the vertices Vj^i 
and Vj are adjacent. The length of a walk is the number of edges (counting 
repetitions). A path is a walk such that no vertex is repeated, except at most 
the initial (vq) and the final (vn) vertex (in this case it is called a cycle). The 
distance between two vertices, denoted V{u^v)^ is the length of the shortest 
path between them. A graph is connected if between each pair of vertices 
there exist a walk. We denote the class of simple connected graphs with 
Q. The neighborhood of a vertex v is the set of vertices that are adjacent 
to V and is indicated with N{y). The neighborhood of radius r of a vertex 
V is the set of vertices at a distance less than or equal to r from v and is 
denoted by Nr{v). In a graph G, the induced- subgraph on a set of vertices 
W = {wi^ . . . , Wk} is a graph that has W as its vertex set and it contains every 
edge of G whose endpoints are in W. A subgraph i7 is a spanning subgraph 
of a graph G if V{H) = V{G). The neighborhood subgraph of radius r of 
vertex v is the subgraph induced by the neighborhood of radius r of 'U and 
is denoted by Af^. A labeled graph is a graph whose vertices and/or edges 
are labeled, possibly with repetitions, using symbols from a finite alphabet. 
We denote the function that maps the vertex/edge to the label symbol as £. 
Two simple graphs Gi = {Vi^Ei) and G2 = (V2,£'2) are isomorphic^ which 
we denote by Gi G2, if there is a bijection : l^i ^ V2, such that for 
any two vertices u^v gVi^ there is an edge uv if and only if there is an edge 
(j){u)(i){v) in G2. An isomorphism is a structure-preserving bijection. Two 
labeled graphs are isomorphic if there is an isomorphism that preserves also 
the label information, i.e. l{(t){v)) = £{v). An isomorphism invariant or graph 
invariant is a graph property that is identical for two isomorphic graphs (e.g. 
the number of vertices and/or edges). A certificate for isomorphism is an 
isomorphism invariant that is identical for two graphs if and only if they are 
isomorphic. 

A hypergraph is a generalization of a graph also known under the name 
of set system. A set system is an ordered pair {V^J-') where 1/ is a set of 



42 



elements and J-" is a family of subsets of V. Note that when is made by 
pairs of elements of V then (V^ F) is a simple graph. The elements of V 
are called vertices of the hypergraph and the elements of the hyperedges. 
There are two principal ways to represent set systems as graphs: as incident 
graphs and as intersect on graphs. In the following we consider only incident 
graphs. Given a set system H = {V^ the associated incident graph is the 
bipartite graph G{\V^ J^], E) where v and F G are adjacent if v G F. 

Appendix D. Decomposition Kernels 



We follow the notation in [85] . Given a set X and a function K : X xX ^ 
M, we say that X is a kernel on X x X if K is symmetric, i.e. if for any x 
and y ^ X K[x^ y) = K{y^ x), and if K is positive-semide finite^ i.e. if for any 

> 1 and any Xi, . . . ,XAr G X, the matrix K defined by Kij = K[xi^Xj) 
is positive-semidefinite, that is ^ijCiCjKij > for all Ci, . . . , cat G M or 
equivalent ly if all its eigenvalues are nonnegative. It is easy to see that if 
each X G X can be represented as = {0n(^)}n>i such that K is the 
ordinary k dot product K{x,y) = ((/)(x), (/)(?/)) = 0n(^)0n(?/) then K 
is a kernel. The converse is also true under reasonable assumptions (which 
are almost always verified) on X and X, that is, a given kernel K can be 
represented as K{x^y) = {(j){x) ^ (j){y)) for some choice of (j). In particular it 
holds for any kernel K over X x X where X is a countable set. The vector 
space induced by is called the feature space. Note that it follows from the 
definition of positive-semidefinite that the zero- extension of a kernel is a valid 
kernel, that is, if S' C X and X is a kernel on S x S then K may be extended 
to be a kernel on X x X by defining K{x^ ^) = if x or ^ is not in S. It is 
easy to show that kernels are closed under summation, i.e. a sum of kernels 
is a valid kernel. 

Let now x G X be a composite structure such that we can define Xi, . . . , x^ 



as its parte ^ Each part is such that x^i G X^^ for = 1, . . . , with D > 1 
where each X^ is a countable set. Let R be the relation defined on the set 
Xi X . . . X Xd X X, such that i?(xi, . . . , x^, x) is true iff Xi, . . . , x/^) are the 
parts of X. We denote with R~^{x) the inverse relation that yields the parts of 
X, that is R~^{x) {xi, . . . , x^) : i?(xi, . . . , Xd-> x)}. In [85j it is demonstrated 
that, if there exist a kernel over X^ x X^ for each d = and if 



^^Note that the set of parts needs not be a partition for the composite structure, i.e. 
the parts may "overlap". 
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two instances x^y G X can be decomposed in Xi, . . . , x^^ and yi, . . . , y^, then 
the foUowing generahzed convolution: 



M 



K{X, y) = ^ Yl Km{Xm, ym) (D.l) 



yi,...,ym e R~^(y) 



is a vahd kernel called a convolution or decomposition kerne' In words: 
a decomposition kernel is a sum (over all possible ways to decompose a struc- 
tured instance) of the product of valid kernels over the parts of the instance. 
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